Feature Transform¶

Load Audio¶

openspeech.data.audio.load.load_audio(audio_path: str, sample_rate: int, del_silence: bool = False) → numpy.ndarray [source]¶: Load audio file (PCM) to sound. if del_silence is True, Eliminate all sounds below 30dB. If exception occurs in numpy.memmap(), return None.

Spectrogram Feature Transform¶

class openspeech.data.audio.spectrogram.spectrogram.SpectrogramFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]¶

Create a spectrogram from a audio signal.

Parameters: configs (DictConfig) – configuraion set
Returns: A spectrogram feature. The shape is (seq_length, num_mels)
Return type: Tensor

Spectrogram Feature Transform Configuration¶

class openspeech.data.audio.spectrogram.configuration.SpectrogramConfigs(name: str = 'spectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 161, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]¶

This is the configuration class to store the configuration of a SpectrogramTransform.

It is used to initiated an SpectrogramTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Parameters

name (str) – name of feature transform. (default: spectrogram)
sample_rate (int) – sampling rate of audio (default: 16000)
frame_length (float) – frame length for spectrogram (default: 20.0)
frame_shift (float) – length of hop between STFT (default: 10.0)
del_silence (bool) – flag indication whether to apply delete silence or not (default: False)
num_mels (int) – the number of mfc coefficients to retain. (default: 161)
apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)
apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)
apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)
apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)

Mel-Spectrogram Feature Transform¶

class openspeech.data.audio.melspectrogram.melspectrogram.MelSpectrogramFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]¶

Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.

Parameters: configs (DictConfig) – configuraion set
Returns: A mel-spectrogram feature. The shape is (seq_length, num_mels)
Return type: Tensor

Mel-Spectrogram Feature Transform Configuration¶

class openspeech.data.audio.melspectrogram.configuration.MelSpectrogramConfigs(name: str = 'melspectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]¶

This is the configuration class to store the configuration of a MelSpectrogramFeatureTransform.

It is used to initiated an MelSpectrogramFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Parameters

name (str) – name of feature transform. (default: melspectrogram)
sample_rate (int) – sampling rate of audio (default: 16000)
frame_length (float) – frame length for spectrogram (default: 20.0)
frame_shift (float) – length of hop between STFT (default: 10.0)
del_silence (bool) – flag indication whether to apply delete silence or not (default: False)
num_mels (int) – the number of mfc coefficients to retain. (default: 80)
apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)
apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)
apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)
apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)

Filter-Bank Feature Transform¶

class openspeech.data.audio.filter_bank.filter_bank.FilterBankFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]¶

Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats.

Parameters: configs (DictConfig) – hydra configuraion set

Inputs:: signal (np.ndarray): signal from audio file.

Returns: A fbank identical to what Kaldi would output. The shape is (seq_length, num_mels)
Return type: Tensor

Filter-Bank Feature Transform Configuration¶

class openspeech.data.audio.filter_bank.configuration.FilterBankConfigs(name: str = 'fbank', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]¶

This is the configuration class to store the configuration of a FilterBankFeatureTransform.

It is used to initiated an FilterBankFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Parameters

name (str) – name of feature transform. (default: fbank)
sample_rate (int) – sampling rate of audio (default: 16000)
frame_length (float) – frame length for spectrogram (default: 20.0)
frame_shift (float) – length of hop between STFT (default: 10.0)
del_silence (bool) – flag indication whether to apply delete silence or not (default: False)
num_mels (int) – the number of mfc coefficients to retain. (default: 80)
apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)
apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)
apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)
apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)

MFCC Feature Transform¶

class openspeech.data.audio.mfcc.mfcc.MFCCFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]¶

Create the Mel-frequency cepstrum coefficients from an audio signal.

By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.

This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.

Parameters: configs (DictConfig) – configuraion set
Returns: A mfcc feature. The shape is (seq_length, num_mels)
Return type: Tensor

MFCC Feature Transform Configuration¶

class openspeech.data.audio.mfcc.configuration.MFCCConfigs(name: str = 'mfcc', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 40, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]¶

This is the configuration class to store the configuration of a MFCCFeatureTransform.

It is used to initiated an MFCCFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Parameters

name (str) – name of feature transform. (default: mfcc)
sample_rate (int) – sampling rate of audio (default: 16000)
frame_length (float) – frame length for spectrogram (default: 20.0)
frame_shift (float) – length of hop between STFT (default: 10.0)
del_silence (bool) – flag indication whether to apply delete silence or not (default: False)
num_mels (int) – the number of mfc coefficients to retain. (default: 40)
apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)
apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)
apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)
apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)